Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a faster string-search method with simd instructions than std::find #10858

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

skadilover
Copy link
Contributor

@skadilover skadilover commented Aug 27, 2024

Description

In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.

Solution

There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] :
[5] doris : first-two comapre first, and the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.

They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself.

In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.

ut & benchmark
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.

Benchmark : std::find vs simdStrStr:

findSuccessful(opt_50_5) 31.66ms 31.59
findSuccessful(opt_100_10) 36.81ms 27.16
findSuccessful(opt_100_20) 96.47ms 10.37
findSuccessful(opt_1k_10) 63.27ms 15.81
findSuccessful(opt_1k_100) 156.95ms 6.37
findSuccessful(std_50_5) 45.24ms 22.11
findSuccessful(std_100_10) 191.23ms 5.23
findSuccessful(std_100_20) 322.11ms 3.10
findSuccessful(std_1k_10) 122.69ms 8.15
findSuccessful(std_1k_100) 168.10ms 5.95
findUnsuccessful(std_first_char_match) 819.13ms 1.22
findUnsuccessful(opt_first_char_match) 261.46ms 3.82
findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01
findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02

References

[1] gcc std::search issue:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66414
[2] glibc strstr implements:
https://github.com/lattera/glibc/blob/master/string/strstr.c
[3] two way search:
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
[4] std::boyer_moore_searcher
https://en.cppreference.com/w/cpp/utility/functional/boyer_moore_searcher
https://github.com/mclow/search-library?tab=readme-ov-file
[5] doris string-search
https://github.com/apache/doris/blob/4e326e5f496109edb253d094869d057573ab7a59/be/src/vec/common/string_searcher.h#L44
[6] stringzilla
https://github.com/ashvardanian/StringZilla/blob/main/include/stringzilla/stringzilla.h#L2187
[7] string-search algorithms
https://www-igm.univ-mlv.fr/~lecroq/string/
[8] sse4 strst
https://github.com/WojciechMula/sse4-strstr
[9] folly benchmark
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTestBenchmarks.cpp.h
[10] folly ut
https://github.com/facebook/folly/blob/ce5edfb9b08ead9e78cb46879e7b9499861f7cd2/folly/test/FBStringTest.cpp#L1277
[11] clickhouse Volnitsky
https://github.com/ClickHouse/ClickHouse/blob/a51f867ce014a4cb8f97c46e004b90f5feafd80f/src/Common/Volnitsky.h#L482

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024
Copy link

netlify bot commented Aug 27, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit cb6e208
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/670492291a31d70009910cfc

@Yuhta Yuhta self-requested a review August 27, 2024 15:16
@skadilover skadilover force-pushed the opt_strstr branch 7 times, most recently from e99e699 to 457860e Compare August 29, 2024 02:30
@skadilover skadilover changed the title Add a fast simd strstr Add a fast string-search method named simdStrStr Aug 29, 2024
@skadilover skadilover changed the title Add a fast string-search method named simdStrStr Add a fast string-search method with simd instructions Aug 29, 2024
@skadilover
Copy link
Contributor Author

skadilover commented Aug 29, 2024

@Yuhta
ping~

from #10731

I don't think you can put variable definition in header file, you probably need to move it to SimdUtil.cpp and have only forward declaration here. Also what is the value if we do MADV_HUGEPAGE for some addresses?

I found that the performance of the code will be reduced by 30% when placed in the cpp file. So for the case, I change the code like this :

  static const int kPageSize = sysconf(_SC_PAGESIZE);

====>
FOLLY_ALWAYS_INLINE bool pageSafe(const void* const ptr, size_t length) {
  static const int kPageSize = sysconf(_SC_PAGESIZE);
  return ((kPageSize - 1) & reinterpret_cast<std::uintptr_t>(ptr)) <=
      kPageSize - CharVector::size - length;
}

MADV_HUGEPAGE for some addresses

I haven't found a good way to handle this situation, could you give me some advice?
Could we assume MADV_HUGEPAGE is always times of _SC_PAGESIZE , 2M vs 4k etc. see https://hugekernel.org/doc/Documentation/admin-guide/mm/transhuge.rst

Modern kernels support "multi-size THP" (mTHP), which introduces the
ability to allocate memory in blocks that are bigger than a base page
but smaller than traditional PMD-size (as described above), in
increments of a power-of-2 number of pages. mTHP can back anonymous
memory (for example 16K, 32K, 64K, etc)

RedHat define the valid values:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-transhuge#s-memory-configure_hugepages

Defines the size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.

@skadilover skadilover changed the title Add a fast string-search method with simd instructions Add a faster string-search method with simd instructions than std::find Aug 29, 2024
@skadilover skadilover force-pushed the opt_strstr branch 3 times, most recently from 991be1a to 54422e3 Compare August 30, 2024 02:14
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the other suggestions in #10731 (review)

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
@skadilover
Copy link
Contributor Author

skadilover commented Sep 9, 2024

@Yuhta

Please consider the other suggestions in #10731 (review)

Updates code for the suggestions in #10731:

These checks can be VELOX_DCHECK since their correctness can be ensured quite locally in the same file

done

VELOX_SIMD_STRSTR_CASE

done

aside: Wondering what is the minimal k to make KMP faster than this. This is also related to content of the strings though.

I have extended StringBenchmark to support different algrithm, and kmp is always slower than simdStrStr(maybe my implement is not good so I add std::boyer_moore_searcher for comparison):

findSuccessful(simd_50_to_5)                              344.39ms      2.90
findSuccessful(std_50_to_5)                               802.61ms      1.25
findSuccessful(std_boyer_moore_50_to_5)                      1.42s   703.66m
findSuccessful(kmp_50_to_5)                                  5.23s   191.05m
findSuccessful(simd_100_to_10)                            315.82ms      3.17
findSuccessful(std_100_to_10)                                1.26s   790.97m
findSuccessful(std_boyer_moore_100_to_10)                 984.90ms      1.02
findSuccessful(kmp_100_to_10)                                6.38s   156.70m
findSuccessful(simd_100_to_20)                            352.90ms      2.83
findSuccessful(std_100_to_20)                                2.03s   492.18m
findSuccessful(std_boyer_moore_100_to_20)                 887.87ms      1.13
findSuccessful(kmp_100_to_20)                                6.16s   162.34m
findSuccessful(simd_1000_to_10)                           254.53ms      3.93
findSuccessful(std_1000_to_10)                            698.50ms      1.43
findSuccessful(std_boyer_moore_1000_to_10)                683.83ms      1.46
findSuccessful(kmp_1000_to_10)                               3.33s   300.39m
findSuccessful(simd_1000_to_100)                           65.42ms     15.29
findSuccessful(std_1000_to_100)                           186.01ms      5.38
findSuccessful(std_boyer_moore_1000_to_100)               381.65ms      2.62
findSuccessful(kmp_1000_to_100)                           913.63ms      1.09
findSuccessful(simd_1000_to_200)                          306.99ms      3.26
findSuccessful(std_1000_to_200)                           642.04ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               683.16ms      1.46
findSuccessful(kmp_1000_to_200)                              9.70s   103.05m
findUnsuccessful(std_first_char_match)                    400.47ms      2.50
findUnsuccessful(opt_first_char_match)                    266.45ms      3.75
findUnsuccessful(std_first_char_unmatch)                  333.54ms      3.00
findUnsuccessful(opt_first_char_unmatch)                  284.63ms      3.51
simd strstr std find std boyer moore kmp
heystack :50 needle :5 344ms 802ms 1.42s 5.23s
heystack : 100 needle :10 315ms 1.26s 984ms 6.38s
heystack :100 needle :20 449ms 2.03s 887ms 1.13s
heystack :1000 needle :10 254ms 689ms 693ms 3.3s
heystack :1000 needle :100 65ms 186ms 381ms 913ms
heystack :1000 needle :200 306ms 642ms 683ms 9.7s

Since this is an almost drop-in replacement for strstr, can you add another test to generate a few thousands of random string pairs with various lengths, and check our result is the same as strstr? Also make sure you cover corner cases like larger needle than haystack, etc.

I update ut refer to folly uts, add random test cases.

Could you have look at this once more?

@skadilover
Copy link
Contributor Author

skadilover commented Sep 9, 2024

@Yuhta ping~

Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark result looks very good

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved
velox/common/base/SimdUtil-inl.h Show resolved Hide resolved
@skadilover
Copy link
Contributor Author

skadilover commented Sep 13, 2024

@Yuhta update code

  1. pageSize => global variable
  2. fix comments above except this:

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

As we inline kPageSize, pageSafe method is faster than before, I move the check to each batch iteration, we only miss the comparison-span-pages case, maybe better, like clickhouse:
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/StringSearcher.h

            while (haystack < haystack_end)
            {
                if (haystack + N <= haystack_end && isPageSafe(haystack))
                {

new benchmark result:

findSuccessful(simd_50_to_5)                              355.40ms      2.81
findSuccessful(std_50_to_5)                               807.73ms      1.24
findSuccessful(std_boyer_moore_50_to_5)                      1.40s   712.05m
findSuccessful(kmp_50_to_5)                                  5.18s   192.87m
findSuccessful(simd_100_to_10)                            319.76ms      3.13
findSuccessful(std_100_to_10)                                1.25s   802.85m
findSuccessful(std_boyer_moore_100_to_10)                 910.54ms      1.10
findSuccessful(kmp_100_to_10)                                6.38s   156.72m
findSuccessful(simd_100_to_20)                            355.43ms      2.81
findSuccessful(std_100_to_20)                                2.07s   483.83m
findSuccessful(std_boyer_moore_100_to_20)                 901.23ms      1.11
findSuccessful(kmp_100_to_20)                                6.20s   161.18m
findSuccessful(simd_1000_to_10)                           245.61ms      4.07
findSuccessful(std_1000_to_10)                            724.69ms      1.38
findSuccessful(std_boyer_moore_1000_to_10)                870.93ms      1.15
findSuccessful(kmp_1000_to_10)                               3.33s   300.34m
findSuccessful(simd_1000_to_100)                           69.67ms     14.35
findSuccessful(std_1000_to_100)                           186.06ms      5.37
findSuccessful(std_boyer_moore_1000_to_100)               286.45ms      3.49
findSuccessful(kmp_1000_to_100)                           913.97ms      1.09
findSuccessful(simd_1000_to_200)                          354.15ms      2.82
findSuccessful(std_1000_to_200)                           642.50ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               675.70ms      1.48
findSuccessful(kmp_1000_to_200)                              9.78s   102.28m
findUnsuccessful(std_first_char_match)                    920.16ms      1.09
findUnsuccessful(opt_first_char_match)                    383.93ms      2.60
findUnsuccessful(std_first_char_unmatch)                  334.28ms      2.99
findUnsuccessful(opt_first_char_unmatch)                  331.98ms      3.01

@skadilover
Copy link
Contributor Author

@Yuhta ping~

// may not generate a general-protection exception (#GP) in this situation,
// and the address that spans the end of the segment may or may not wrap
// around to the beginning of the segment.
for (; i <= n - needleSize && pageSafe(s + i) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, we will drop out of this fast path as soon as [s + i, s + i + needleSize) across the page boundary? So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore?

Copy link
Contributor Author

@skadilover skadilover Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore

yes , your understand correctly, this way maybe better than that put check outside because we could still check part of data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think we can still use the fast path for the second, and any subsequent pages? Why do we need to drop out early?

Copy link
Contributor Author

@skadilover skadilover Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let`s detail the case here:

image

ref : Computer Systems: A Programmer's Perspective section 9.3

Four virtual pages (VP 1, VP 2, VP 4, and VP 7)
are currently cached in DRAM. Two pages (VP 0 and VP 5) have not yet been
allocated, and the rest (VP 3 and VP 6) have been allocated, but are not currently
cached.

I think we only need to avoid to load data at the end of VP 4 because VP 5 is not allocated yet, so we only need to check if the end of input string array is pageSafe

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

And we could only check the lst pointer in substring-search:

    if (i + needleSize + CharVector::size > n &&
        !pageSafe(s + i + needleSize - 1)) {
      break;
    }

I think we only miss last page in the case that the end of input string is right at the end of VP4 by this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta update comment and code .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's very clear now, thanks

@skadilover
Copy link
Contributor Author

@Yuhta

Any more comments about this ?

@skadilover skadilover force-pushed the opt_strstr branch 2 times, most recently from 7aa4d26 to 446f09d Compare September 26, 2024 03:04
// Assume that the input string is allocated on virtual pages : VP1, VP2,
// VP3 and VP4 has not been allocated yet, we need to check the end of input
// string is page-safe to over-read CharVector.
if (i + needleSize + CharVector::size > n &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition can be loosen to i + needleSize - 1 + CharVector::size > n. Maybe this is more clear:

const auto last = i + needleSize - 1;
if (last + CharVector::size > n && !pageSafe(s + last)) {
  break;
}
// ...
auto blockLast = CharVector::load_unaligned(s + last);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants